2 Data
2.1 Data Sources
Data used in this project comes from two sources, namely:
The Australian Electoral Commission (AEC, https://www.aec.gov.au/ ). The national body overseeing and running federal elections, the AEC contains detailed election result records. All results for federal elections held in the 21st century are available online, through their Tally Room website [11].
The Australian Bureau of Statistics (ABS, https://www.abs.gov.au/) . The ABS provides a wide number of national statistics and is responsible to conduct a national census of population and housing every 5 years. Comprehensive census data is provided in multiple formats, including CSV files through Census Data Packs [12], available for censuses from 2006 onwards.
Both organisations are the authoritative source for electoral and statistical data in Australia, and the data is provided openly. Although there are no quality issues, the way that data is provided presents other challenges:
- In both cases, data are provided in large volumes and exhaustive granularity. Data extraction and aggregation can be time-consuming and resource-intensive if not done effectively.
- Census data points are provided using the ABS own geographical standard - and only a small selection of census data is provided already aggregated for each Commonwealth Electoral Division. Conversion between ABS geographical structures and electoral divisions is not straightforward as there is no 1:1 correspondence. Both geographical reference systems are modified at each election and each census.
- Despite the best efforts of both organisations in keeping consistency, names of electorates, parties, and census attributes change over time, which requires keeping track of all those changes and mapping them accordingly.
To assist in dealing with these issues and ensure repeatability, it was necessary to write code to guarantee some level of repeatability and consistency when extracting and transforming data. This resulted in three R packages being written to undertake this task:
- {auspol} [13] https://carlosyanez.github.io/auspol/), which extracts and presents electoral results.
- {auscensus} [14] https://carlosyanez.github.io/auscensus/, which allows to interact with Census Data Packs to extract different statistics across geographical units, and across censuses.
- {aussiemaps} [15] https://carlosyanez.github.io/aussiemaps/, which assists with aggregating census data into electoral divisions, by matching and apportioning different geographical structures.
Although this project’s scope is data analysis, writing this three packages was a significant task, requiring as large time investment. Appendix B presents a more detailed view on their design principles. Additionally appendices C, D, and E contain a copy of some of their vignettes explaining their use. With their help, it was possible to build a basic data extraction and transformation pipeline, which is represented by figure 2.1.
Figure 2.1: Flow of data from sources to dataset
In four steps, the extraction process consists of:
Census data was extracted from the respective Census Data Pack using {auscensus}. Using the package workflow, key attributes were identified in each census, extracted from the respective files and given common names. Data were extracted for statistical areas and apportioned into Commonwealth Electoral Divisions by overlapping area, with the help of functions written into {aussiemaps}
Primary vote results for each division were extracted using the {auspol} package.
All the data was stored in a local database, from where it was extracted and put together in a single dataset.
From there, the “raw” data was further processed and stored in a single “consolidated” dataset. This dataset has been further refined throughout the data exploration and modelling steps.
Processing sources into a usable dataset represented a particular arduous tasks. Although there were no issues in terms of missing or erroneous data, the previously mentioned challenges meant that intense processing was required. Getting data into the right shape consisted of:
- Obtaining data at the most granular spatial division available in the Census packs (e.g. Statistical Area 1 (SA1) for the 2011,2016 and 2021 censuses). For reference for the 2021 Census there were 57,523 SA1 areas in Australia.
- Mapping SA1 data into Commonwealth Electoral Divisions. For SA1s overlapping more than one electoral, data was apportioned based on overlapping areas.
- Converting all values from absolute (e.g. number of votes, number of people in a particular demographic category) to percentages against total number of voters (elections)/ individuals (census) in the respective electorate.
2.2 Data Selection and Initial Transformation
There were a number of considerations that were taken to obtain the dataset that was eventually used, including what and how to represent the statistics and how to best align census and election data.
2.2.1 How to present the numbers
The first point to consider was how to represent the data in a way that is consistent across electorates and time. As mentioned in the introduction, although the aim behind the creation and geographical distribution of Commonwealth Electoral Division is to provide equal representation in Parliament for every Australian, this is not completely possible in practice, resulting in electorates varying in population. This is mainly due to the large variation in population density across Australia, combined with a constitutional mandate to guarantee a minimum number of seats per state or territory. For this reason, it is deemed necessary to represent all voting and demographic statistics as a percentage of each electorate’s roll or population. This is also useful when comparing statistics across time.
The second point to address is the correspondence between census and election data. Since the election the census cycle (5 years) does not match the electoral cycle (determined by the incumbent government, with a 3-year term for the House of Representatives), there is a potential problem of the census data not being completely representative of the population on a given election day. Figure 2.2 presents the best matches between both events held in the 21st century.
Figure 2.2: Census and Elections Timeline
Considering the census data available and selecting the elections closer to each census, four sets of events were selected for data extraction. They are presented in table 2.1.
| Census | Election |
|---|---|
| 2006 | 2007 |
| 2010 | 2011 |
| 2016 | 2016 |
| 2021 | 2022 |
Please note that this selection will remove half of the election events within the period, which may affect model accuracy. However, since the objective is not to obtain an accurate prediction this has been accepted as an acceptable trade-off to avoid, instead of having to interpolate demographic statistics.
2.2.2 Electoral Data
In the case of the electoral data. not much processing was required. The source data already contains records of primary voting for each electorate. The only adjustment was to reclassify the vote into four groups (referred to as parties in this document):
- ALP for the Australian Labor Party.
- COAL representing the Coalition, made of the Liberal Party, the National Party, the Liberal National Party of Queensland 3 and the Country Liberal Party in the Northern Territory 4 .
- GRN for the Australian Greens.
- Other to collect votes from any other candidates, including minor parties and independents.
A data sample is presented in table 2.2.
| Year | Division | Abbreviation | Party | Votes | Percentage |
|---|---|---|---|---|---|
| 2022 | Canberra | ALP | Australian Labor Party | 34,574 | 45.20 |
| 2022 | Canberra | GRN | The Greens | 19,240 | 25.15 |
| 2022 | Canberra | COAL | Liberal (Coalition) | 16,264 | 21.26 |
| 2022 | Canberra | Other | Other Parties | 6,417 | 8.39 |
2.2.3 Census Data
When it comes to Census data, a number of considerations had to be tackled during extractions, namely:
Large volumes of data. Each census collected a large number of statistics. For instance, the data release for the 2022 Census contains 62 different tables, ranging from 8 5 to 1,590 6 attributes.
Data aggregated per electorate. Although the ABS provides statistics for non ABS geographical structures, this only includes a subset of all data points collected. Thus, in many cases is necessary to extract data for granular-level ABS units (SA1 in 2022) and aggregate them into electoral divisions. Without knowing the population density for each SA1, values have been approximately apportioned using areas.
Consistency across time. Due to the changing nature of a Census (to better serve its purpose), there are some minor variations in how data is collected and aggregated from Census to Census.
To obtain a first selection of potentially relevant demographic variables to extract, existing literature and journalistic sources were consulted ([1], [2], [3]). Since many variables are colinear by definition (e.g. income groups) or they are closely related (e.g age and relationship status), the initial selection was inspected. After iteration, a resulting set of 55.00 attributes was chosen, which can be classed into the following categories:
Income: Distribution of the population in pre-set income brackets. The highest income bracket includes everyone earning 2,000 dollars or more each week.
Education Level: Distribution of educational achievement (from incomplete secondary to vocational education and academic degrees).
Age: Year of birth is captured in the census, which was grouped into generational cohorts. The four groups of interest are Baby Boomers (1946 to 1964), Generation X (1965 to 1980), Generation Y (1981 to 1996) and Generation Z (1997 to 2021).
Relationship status: Variables describing civil status (e.g. living alone, married, in a de facto relationship).
Household type: Descriptors of type of housing, (e.g. standalone house, semi-detached, flats).
Household tenure: Descriptors of house ownership, rental or another arrangement (e.g. public housing).
Citizenship: Percentage of the population that hold Australian citizenship. Although non-citizens are not entitled to vote, this variable can be taken as a proxy for the relative integration of migrant communities into civic life.
Religion: Percentage of the population declaring to profess a religion. For this analysis, large and high-growth religious groups were selected. For practical reasons and to use as a potential community proxy, the values of Anglican, Presbyterian and Uniting followers were merged into a single statistic.
Language: Languages spoken in the community. Similar to religion, a selection of relevant languages have been included to reflect the historic and current migrant communities.
Additionally, each electorate was classified as metropolitan if it lies within the boundaries of Australian capital cities or non-metropolitan, when it is not the case. Altogether, these variables try to reflect wealth and education (cited by [1] as key factors influencing political persuasion), as well as the stage in life and belonging to a particular migrant community (sometimes cited as an influential factor, for instance in [3]).
A sample of the resulting dataset is present in table 2.3. A detailed list of all variables is presented in appendix A.
| Election Year | Division | Australian Citizens | Baby Boomers | Gen Y | Rented | Chinese | Italian | Single Parent | 2000 or_more |
|---|---|---|---|---|---|---|---|---|---|
| 2016 | Ballarat | 89.28 | 21.03 | 18.94 | 22.08 | 0.74 | 0.24 | 4.92 | 4.99 |
| 2016 | Berowra | 87.36 | 20.32 | 16.03 | 12.39 | 9.87 | 0.95 | 3.03 | 15.27 |
| 2007 | Braddon | 92.32 | 27.01 | 20.31 | 18.16 | 0.10 | 0.06 | 4.77 | 1.10 |
| 2007 | Charlton | 92.63 | 26.59 | 21.29 | 17.37 | 0.30 | 0.39 | 5.06 | 1.98 |
| 2022 | Chifley | 79.10 | 15.26 | 25.53 | 27.80 | 1.83 | 0.26 | 5.38 | 7.28 |
| 2007 | Deakin | 87.83 | 24.02 | 18.44 | 19.39 | 5.26 | 1.68 | 4.53 | 3.06 |
| 2010 | Dickson | 89.21 | 15.38 | 22.13 | 18.84 | 0.46 | 0.37 | 4.11 | 6.32 |
| 2016 | Dunkley | 85.79 | 20.99 | 19.23 | 23.29 | 0.99 | 0.61 | 5.52 | 6.29 |
| 2010 | Forde | 81.84 | 15.02 | 23.15 | 28.86 | 0.82 | 0.25 | 5.23 | 3.16 |
| 2007 | Franklin | 90.91 | 27.94 | 20.90 | 14.16 | 0.20 | 0.21 | 5.17 | 1.56 |
| 2010 | Goldstein | 85.30 | 17.69 | 18.41 | 22.44 | 2.95 | 1.40 | 3.63 | 14.51 |
| 2022 | Goldstein | 86.29 | 21.45 | 17.03 | 23.33 | 4.74 | 0.99 | 3.73 | 25.12 |
| 2022 | Hawke | 85.08 | 16.98 | 24.50 | 21.33 | 0.66 | 0.73 | 5.20 | 8.56 |
| 2022 | Hughes | 91.96 | 19.48 | 19.62 | 16.63 | 2.54 | 0.57 | 3.77 | 17.88 |
| 2007 | Hume | 91.08 | 27.36 | 20.11 | 18.12 | 0.21 | 0.50 | 4.08 | 2.67 |
| 2016 | Hume | 90.02 | 19.60 | 19.36 | 16.83 | 0.45 | 0.53 | 4.29 | 7.16 |
| 2007 | Lowe | 78.42 | 24.84 | 21.19 | 31.11 | 11.98 | 8.36 | 3.88 | 6.63 |
| 2016 | Makin | 86.19 | 19.88 | 21.67 | 17.17 | 2.13 | 1.27 | 4.83 | 4.21 |
| 2022 | Mitchell | 84.75 | 17.08 | 20.68 | 20.87 | 11.83 | 0.64 | 3.10 | 21.53 |
| 2022 | Paterson | 92.16 | 22.47 | 20.28 | 24.58 | 0.44 | 0.11 | 5.51 | 8.56 |
| 2010 | Petrie | 83.74 | 18.02 | 21.41 | 29.00 | 0.84 | 0.51 | 5.29 | 4.27 |
| 2007 | Robertson | 87.34 | 26.20 | 20.02 | 22.51 | 0.39 | 0.36 | 5.17 | 2.59 |
| 2016 | Wakefield | 86.84 | 17.95 | 20.37 | 23.47 | 0.22 | 0.72 | 6.07 | 2.87 |
| 2007 | Wannon | 92.24 | 27.25 | 20.43 | 18.32 | 0.14 | 0.11 | 3.94 | 1.75 |
| 2007 | Wide Bay | 89.00 | 28.94 | 18.84 | 22.09 | 0.10 | 0.16 | 4.77 | 0.96 |
2.3 Training, Validation and Testing Split
After obtaining the data, the election results and census statistics for the 2021/2022 cycle were set aside, since they have been used as testing dataset, in a election forecast attempt. The remaining data has been used in exploratory analysis, data mining and creating and fitting models.
2.4 Data Exploration
In total, the resulting dataset is made up of 4 response variables and 55 potential predictors, plus identification attributes like division name and election year. As expected the many covariates exhibit moderate to high collinearity. Also, it is possible to observe some loose correlation between some of the covariates and some of the responses. The complete list of variables is presented in appendix A.
As an example, figure 2.3 shows a somewhat weak correlation between Coalition primary vote and the percentage of the Baby Boomers. Figure 2.4 presents the correlation values for religion and language variables, where is possible to see:
A positive correlation between monolingual English speakers and membership in Anglican, Presbyterian and Uniting churches. Together, they are likely proxies for Anglo-Celtic population.
Similarly, there are somewhat expecting origins that most likely indicate concentrations of linguistically and culturally diverse pockets, e.g. Hinduism and South Asian languages, Catholicism and Italian, and Buddhism and East Asian languages.
Figure 2.3: Correlation between Coalition vote and Baby boomer population
Figure 2.4: Correlation for selected covariates
Additionally, after a detailed inspection, it is worth noticing that :
There is no apparent change in the relationship between a given covariate and the responses when broken down by state or capital city.
There are no obviously distinguishable differences when splitting results by each election.
2.4.1 Dimensionality reduction using Multiple Factor Analysis
Given the large number of colinear covariates, it is worth exploring if a reducation of dimensionality could help to better measure variation in a meaningful way and in a more manageable number. To achieve this, multiple factor analysis (MFA) [16] was used as the clustering algorithm. MFA is essentially an extension of Principal Component Analysis that can deal with categorical and numerical variables, as well as variables that belong to groups (e.g in this case, Income, Religion, Household Tenure, etc.).
The resulting scree plot and cumulative variance are presented in figure 2.5.
Figure 2.5: Scree plot and cumulative variance
Figure 2.6 presents group biplots for the 8 most important dimensions. Unfortunately, there is no straightforward representation except with Dimension 2 and Education variables.
Figure 2.6: Group plots for first 8 dimensions
2.4.2 Electorate segments
Normally when characterising votes, Australian politicians and political media make a distinction between inner-city voters (touted as wealthy and progressive), suburbia (“middle Australia”), and the bush and outback areas (conservative, “battlers”, “real Australia”). Therefore, it is of interest to explore if this can be substantiated by demographic attributes, as it may have an impact on primary voting.
Using all demographic variables a clustering algorithm has been applied to identify those clusters. Different clustering approaches were, eventually choosing to:
- ignore Census years and pool all records in a single pool.
- transform all demographic attributes to represent the difference between each data point and their corresponding national value (in the same year).
- use Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [17], a density-based hierarchical clustering algorithm.
HDBSCAN is essentially a density-based method to find common density areas, for which then a hierarchical analysis of the results is used to heuristically determine the number of processes. This is an appropriate method in this scenario.
This results in 3 distinct clusters of electorates. When presented in a map, it is possible to obtain figures 2.7 and 2.8 for 2016.
Figure 2.7: Clusters for 2016 Election
Figure 2.8: Clusters for 2016 Election - Interactive
These three clusters are:
cluster 0 seems to mostly contain electorates located in the inner cities, especially in Sydney and Melbourne. These areas tend to be more affluent, either “established” or “gentrified” suburbs. Notably, it also contains the three northernmost, remote electorates.
cluster 1 comprises all regional areas outside state capitals (with the exception of Hobart in Tasmania).
cluster 2 largely represents “suburbia”. It is also more prevalent in Brisbane and Perth compared when comparing capital cities.
Revisiting the demographic attributes can help to understand how these clusters differ from each other. A selection of those variables is presented in figure 2.9.
Figure 2.9: Selected attributes, coloured by cluster.
Even though it is possible to find electorates from every country across the spectrum for every attribute, it is possible to observe that cluster 0 tends to concentrate areas with significant Millenial, highly educated, and relatively affluent populations. These areas also tend to attract newer migrants (lower numbers of citizens) and therefore they possess higher percentages of multicultural populations (such as Chinese speakers). Cluster 1 tends to concentrate older people, with lower percentages of tertiary and vocational education and possibly higher proportions of Anglo-Celtic Australians. Cluster 2 seems to be sitting in the middle of the other two clusters. Adding these findings to the geographical locations seems to confirm there is some element of truth in the stereotypical classification of voters.
In Queensland and the Northern Territory, the Liberal and National branches have merged. Elected federal MPs and senators sit with Liberals if they come from an urban area, or the Nationals when they represent a regional/rural/remote electorate.↩︎
In Queensland and the Northern Territory, the Liberal and National branches have merged. Elected federal MPs and senators sit with Liberals if they come from an urban area, or the Nationals when they represent a regional/rural/remote electorate.↩︎
02 -Selected Medians and Averages↩︎
09 - Country of Birth of Person by Age by Sex↩︎